Machine Translation for Twitter

ثبت نشده
چکیده

We carried out a study in which we explored the feasibility of machine translation for Twitter for the language pair English and German. As a first step we created a small bilingual corpus of 1,000 tweets. Using this corpus we carried out an analysis of the linguistic features of tweets. We tested different strategies of domain adaptation and found that they improved translation performance. In our experiments we found large differences in performance due to the handling of unknown words. By using xml-markup we were able to reduce this difference. We also replaced special Twitter expressions with placeholders, which enabled us to learn more robust n-gram statistics from Twitter data. We carried out a small-scale human evaluation to balance our automatic scores. Finally, we tested strategies to enforce translation output of legal length. Generating n-best-lists of translation candidates and searching for legal tweets was found to be helpful, but ultimately too unreliable because there was no systematic way to determine the required value of n. We suggested a feature function based on character count as a potential solution.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Machine Translation for Twitter

We consider the problem of translating short messages (Tweets) using Europarl as a starting-point. After highlighting some of the domain differences between Europarl and Twitter, we show that for German-English translation, we can improve performance from a baseline BLEU score of 25.58 to 53.45. By far and away the single most important improvement is passing-through unknown words (which are ma...

متن کامل

Machine Translation for Twitter

We carried out a study in which we explored the feasibility of machine translation for Twitter for the language pair English and German. As a first step we created a small bilingual corpus of 1,000 tweets. Using this corpus we carried out an analysis of the linguistic features of tweets. We tested different strategies of domain adaptation and found that they improved translation performance. In...

متن کامل

A High-Performance Model based on Ensembles for Twitter Sentiment Classification

Background and Objectives: Twitter Sentiment Classification is one of the most popular fields in information retrieval and text mining. Millions of people of the world intensity use social networks like Twitter. It supports users to publish tweets to tell what they are thinking about topics. There are numerous web sites built on the Internet presenting Twitter. The user can enter a sentiment ta...

متن کامل

Linguistic steganography on Twitter: hierarchical language modeling with manual interaction

This work proposes a natural language stegosystem for Twitter, modifying tweets as they are written to hide 4 bits of payload per tweet, which is a greater payload than previous systems have achieved. The system, CoverTweet, includes novel components, as well as some already developed in the literature. We believe that the task of transforming covers during embedding is equivalent to unilingual...

متن کامل

Does 'well-being' translate on Twitter?

We investigate whether psychological wellbeing translates across English and Spanish Twitter, by building and comparing source language and automatically translated weighted lexica in English and Spanish. We find that the source language models perform substantially better than the machine translated versions. Moreover, manually correcting translation errors does not improve model performance, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010